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Abstract 

Our study revisits the problem of accuracy- 
fairness tradeoff in binary classihcation. We ar¬ 
gue that comparison of non-discriminatory clas- 
sihers needs to account for different rates of 
positive predictions, otherwise conclusions about 
performance may be misleading, because accu¬ 
racy and discrimination of naive baselines on the 
same dataset vary with different rates of pos¬ 
itive predictions. We provide methodological 
recommendations for sound comparison of non- 
discriminatory classifiers, and present a brief the¬ 
oretical and empirical analysis of tradeoffs be¬ 
tween accuracy and non-discrimination. 

1. Introduction 

Discrimination-aware machine learning is an emerging re¬ 
search area, which studies how to make predictive models 
free from discrimination, when historical data, on which 
they are built, may be biased, incomplete, or even contain 
past discriminatory decisions. Research assumes that the 
protected grounds, against which discrimination is forbid¬ 
den, are given by legislation. The goal for machine learning 
is to develop algorithmic techniques for incorporating those 
non-discriminatory constraints into predictive models. 

A number of studies in discrimination-aware ma¬ 
chine learning and data mining (Pedreschi et ah, 2009; 
Kamiran et al., 2010; Calders & Verwer, 2010) focus on 
achieving equal acceptance rates (proportions of positive 
decisions) for favored and protected groups of individu¬ 
als in binary classihcation. Forcing acceptance rates to 
be equal without taking into account other characteristics 
of individuals can be seen as an affirmative action, which 
introduces positive discrimination promoting the protected 
community. This may be desired for legal and political rea¬ 
sons. 

We revisit this popular scenario of discrimination aware 


The 2"'^ Workshop on Fairness, Accountability, and Transparency 
in Machine Learning, Lille, France. Copyright 2015 by the author. 


machine learning, and identify some pitfalls to avoid when 
comparing the performance of such classihers, that is, a 
comparison may be misleading if the proportions of pos¬ 
itive predictions of the classihers are different. We pro¬ 
vide methodological recommendations for sound compari¬ 
son, and present a brief theoretical and empirical analysis 
of tradeoffs between accuracy and non-discrimination. 

2. Problem setting and assumptions 

Given a dataset that contains discrimination the goal is to 
build a classiher that would be as accurate as possible, and 
obey non-discrimination constraints. For example, a model 
could decide upon granting a loan given demographic in¬ 
formation and hnancial standing, and considering ethnicity 
of an applicant (native, foreign) as the protected ground. 
We assume that the values of the target variable (labels) in 
the historical dataset are objectively correct, e.g. whether 
the loan has been repaid or not. For discrimination to hap¬ 
pen the target variable needs to be polar, that is, one out¬ 
come (accept) should be preferred over the other (reject). 

Let X denote a set of input variables (e.g. salary, assets), 
s denote the protected characteristic (e.g. ethnicity; native 
(w) or foreign (b)), and y denote the target variable (e.g. 
loan decision: accept (-F) or reject (—)). A classiher maps 
X to y, that is, y = f{X). Even though s is not among 
the input variables, some variables in X may be correlated 
with s (e.g. social security payment history may be shorter 
for foreigners, because they have arrived recently), and, as 
a result, classiher / may capture the protected characteris¬ 
tics, and induce indirect discrimination in decision making. 

Let discrimination be measured as the difference in rates 
of acceptance; d = p{+\w) — p(-|-|6). Suppose that dis¬ 
crimination in the historical dataset is do = 5, the desired 
discrimination in the classiher output is d*, the proportion 
of favored individuals in the data is p{w) = a, the prior 
probability of acceptance in the data is p(-l-) = ttq, and the 
rate of acceptance in the classiher output is Pf{+) = tt. 

Many classihers produce probability scores (such as Naive 
Bayes or logistic regression). Typically, a probability score 
can be computed for non-probabilistic classihers as well 
(such as kNN, SVM, decision trees). Individuals scoring 
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Figure 1. Accuracy and discrimination measured directly. 




above a threshold, which by default is typically 0.5, will 
get a positive decision. Considering available resources a 
decision maker can choose a different threshold. Suppose 
that the objective is to keep discrimination at the desired 
level d* (typically zero), and at the same time maximize the 
prediction accuracy. Effectively, by choosing the threshold 
a decision maker chooses the acceptance rate tt. 

3. Accuracy and fairness 

The performance of discrimination-aware classihers is typ¬ 
ically compared by plotting discrimination vs. accuracy. 
An attempt to remove discrimination can easily produce 
classifiers with different acceptance rates tt from those in 
the original dataset, especially when using off-the-shelve 
classifier implementations (e.g. WEKA^), which simply 
round the numerical probability scores without any con¬ 
straints on the positive output rates. 

Our main message is that evaluation of non- 
discriminatory classifiers must take into account 
rates of acceptance, otherwise classifier performance is 
not comparable, because changing the acceptance rate 
changes baseline accuracy and baseline discrimination. 


sion) is to that of a random classifier, which assigns labels 
at random. Therefore, better observed accuracy does not 
necessarily mean better classification ability, if the accep¬ 
tance rates of the two classifiers are different. In order to 
be able to compare such classifiers we could normalize the 
accuracy with respect to tt. Therefore, we suggest using for 
comparison a normalized accuracy, such as Cohen’s Kappa 
(Cohen, 1960), which indicates by how much a classifier in 
question is better than a random classifier; 


-R 

1-R’ 


where A is the accuracy of the classifier in question, and R 
is the accuracy of a random classifier, in our case 
R = ttqtt + (1 — 7ro)(l — tt). Note, that k € [0,1], where 1 
means the ideal accuracy, and 0 indicates a random result^. 

In the right plot we see how discrimination varies with dif¬ 
ferent acceptance rates. There is no discrimination if every¬ 
body is accepted, or nobody is accepted, and the closer the 
acceptance rate tt gets to these extremes, the smaller is d. 
This is not due to a better fairness of the classifier, because 
the classifier is exactly the same, and its output is the same, 
just the classification threshold varies. We would like to as¬ 
sess the fairness of the classifier, therefore, similarly to the 
accuracy, we need to normalize the result with respect to tt. 

We propose to normalize d by the maximum possible dmax 
at each tt. Discrimination would be at its maximum if a 
classifier ranks candidates in such a way that first every¬ 
one from the favored community is accepted, and only then 
candidates from the protected community start to be ac¬ 
cepted"^. In such a case the maximum discrimination is 


. /tt 1 - TT 
dmax = min -- 

a 1 — a 


( 2 ) 


A small experiment with a benchmark dataset (Adult from 
UCI^ repository) illustrates the situation. The target vari¬ 
able describes whether a person has high income or low. 
The protected characteristic (gender) is not among the in¬ 
puts. We randomly split the dataset into two halves: train¬ 
ing and testing. We train a logistic regression (similar re¬ 
sults have been obtained with Naive Bayes and decision 
tree J48) on a train set, output class probability scores for 
the test set, and vary the classification threshold from 0 to 
1, which changes the acceptance rate tt. We also plot the 
accuracy of a random classifier that does not use any inputs, 
but randomly decides upon the outcome given the probabil¬ 
ity of acceptance tt. Eigure 1 presents the results. 

Erom the left plot we see that the more extreme the ac¬ 
ceptance rate is (either all reject, or all accept), the closer 
the performance of an intelligent classifier (logistic regres- 

'http :II WWW.cs.Waikato.ac.nz/ml/weka/ 

^http://archive.ics.uci.edu/ml/ 


where a is the proportion of the favored community indi¬ 
viduals in the data, and tt is the acceptance rate. 

We propose to normalize the discrimination measure by the 
maximum possible discrimination. 

^ Pi+\w) - Pi+\b) 

0 = - -T -, (3) 

^max 

where dmax given in Eq. (2) is the maximum possible 
discrimination at a given acceptance rate. The maximum 

^One could consider other accuracy measures for imbalanced 
data, such as F-score. We prefer Cohen’s Kappa, since F-score 
does not behave consistently at the extreme acceptance rates, and, 
therefore, is more difficult to interpret. F-score of a classifier that 
accepts everybody would be equal to ttq, which varies depending 
on the dataset, while Kappa always gives 1 in this case. 

"’ll can be compared to a (supposedly fictional) evacuation pro¬ 
cedure from the Titanic. Passengers are put in a queue, where all 
the first class passengers have a priority over third class passen¬ 
gers. Then as many passengers are evacuated, as there are boats. 
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Figure 2. Normalized accuracy and discrimination. 


Figure 3. Oracle. 


value of S is 1, which means the worst possible discrimina¬ 
tion, where the favored community has a complete priority, 
6 = 0 means no discrimination where people from the fa¬ 
vored and protected communities fully mix in the queue. <5 
can be negative, indicating a reverse discrimination. 

Figure 2 plots normalized accuracy k and normalized dis¬ 
crimination 6 of the logistic regression in our experiment. 
Large part of discrimination appears to be flat and closely 
in line with the discrimination in the data. The results now 
make sense, since the classifier in the experiment does not 
have any mechanisms for discrimination removal. At the 
extreme ends, where everybody is accepted, or everybody 
is rejected, intuitively, there is no discrimination, and the 
normalized measure correctly shows no discrimination. 

4. Baselines and tradeoffs 

It has been observed (Kamiran et al., 2010) that, assuming 
the labels in data are correct, discrimination removal comes 
at a cost - it reduces prediction accuracy. The authors have 
found given no constraints on the acceptance rates, that the 
maximum possible accuracy decreases linearly with reduc¬ 
ing difference in rates of acceptance. We revisit the prob¬ 
lem of accuracy-fairness tradeoff to see if the normalized 
measures would show similar relations. 

An oracle is a Actional baseline classifier that has the max¬ 
imum possible intelligence (as if it knows the true labels), 
and strives to satisfy non-discrimination constraints. A ran¬ 
dom classifier is the opposite, it does not use any intel¬ 
ligence. For each individual a random classifier makes a 
random prediction with the probability of acceptance tt. 

The accuracy of the oracle will be Aq = 1, kappa will 
be kq = 1, the discrimination would be as in the data do 
and So- The random classifier defines the other baseline of 
performance with A = ttott + (1 — 7ro)(l — tt), k = 0, and 
d = 6 = 0. With TT = 0 (or TT = 1) the random classifier 
turns into the majority class classifier. 

Suppose, a decision maker aims at removing all discrim¬ 
ination such that d* = 0 and d* = 0. As suggested in 
(Kamiran et al., 2010), the oracle would either reduce the 


acceptance rate for the favored community (if a < 0.5), or 
increase the acceptance rate for the protected community 
(if OL > 0.5). The resulting decrease in classification accu¬ 
racy would be linearly proportional to the discrimination in 
the data (Aq — A) = min (a, (1 — a)) (dg — d). 

We And that if the rate of acceptance is to be Axed, that 
is TT = ttq, then the normalized accuracy of the oracle de¬ 
creases linearly with decrease in normalized discrimination 

(ko - At) = min (do - (4) 

VTTo l-TTo/ 

If the rate of acceptance does not need to be Axed, the opti¬ 
mal strategy is still the same - either to reduce acceptance 
for the favored community (’’decrease males”), or to in¬ 
crease acceptance for the protected community (’’increase 
females”), but the choice now depends not only on a, but 
also on ttq and 6*. We do not have a closed form solu¬ 
tion at the moment, but Figure 3 presents simulated results 
of the oracle classifler on the benchmark dataset (Adult). 
’’Change both” is the solution where the acceptance rate is 
kept the same as in the original data. These experiments 
show the maximum possible accuracy, given the discrimi¬ 
nation constraints. We can see that when using the normal¬ 
ized measures for accuracy and discrimination the upper 
bounds remain linear. 

5. Interesting cases 

We wrap up our study with an experiment to illustrate the 
difference between the raw and normalized measures when 
comparing non-discriminatory classiflers. 

The experiment compares the performance of three classi¬ 
flers (logistic regression. Naive Bayes and decision tree J48 
from WEKA) trained using three different strategies: in¬ 
cluding the protected characteristic among classifler inputs, 
excluding the protected characteristic from classifler in¬ 
puts, and excluding the protected characteristic from clas¬ 
sifler inputs plus massaging the labels of the training data. 
Massaging is perhaps the simplest discrimination removal 
strategy, it has been introduced in (Kamiran & Calders, 
2009). Training labels are converted from binary to nu¬ 
meric using a ranker function, we use a logistic regression 
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Table 1. Performance of classifiers, everything x 10 ^ 



P(+) 

TT 

Acc. 

A 

Disc. 

d 

N. acc. 

n 

N. disc. 

S 

Data/oracle 

24.7 

100 

19.9 

100 

54.4 

Logistic with s 

20.2 

84.9 

18.3 

56.7 

61.4 

Logistic no s 

20.1 

84.9 

17.6 

56.6 

59.6 

Logistic massage 

22.1 

83.5 

6.9 

53.9 

21.3 

NB with s 

15.4 

81.9 

13.5 

44.2 

59.7 

NB no s 

14.4 

81.4 

10.9 

41.7 

51.3 

NB massaged 

15.4 

81.5 

6.8 

43.3 

29.7 

Tree J48 with s 

19.6 

85.1 

17.9 

56.9 

61.9 

Tree J48 no s 

19.6 

85.0 

17.9 

56.7 

61.8 

Tree massage 

22.9 

83.5 

6.1 

54.6 

18.1 


fit on the same training data. A number of lowest ranked 
males who have a positive label are changed to negative, 
and the same number of highest ranked females, who have 
a negative label, are changed to positive such that the pos¬ 
itive rate remains the same as in the original data, but the 
discrimination is zero. Then a classifier is learned on this 
modified training data. Testing data is not modified. Ta¬ 
ble 1 presents the results measured on the testing data. 

We can make several interesting observations. First, all 
classifiers tend to output lower acceptance rates than that 
in the original data. At the same time, if the protected char¬ 
acteristic is used, the discrimination measure d may show 
a decrease in the nominal discrimination as compared to 
the original data, but the normalized discrimination 6 by all 
three classifiers is even higher than in the data. Apparently, 
a classifier learned on discriminatory data without any pro¬ 
tective measures amplifies discrimination. 

Removing the protected characteristic (no s) indicates lit¬ 
tle improvement in discrimination. This is due to, so called, 
redlining effect. A number of features in the data are corre¬ 
lated with the protected characteristic, therefore, discrimi¬ 
nation is still captured, and, in cases of logistic regression 
and decision tree, is still higher than in the original dataset. 

Interestingly, massaging strategy outputs higher acceptance 
rates than removing the protected characteristic. The ac¬ 
ceptance rates of massaging are closer to the positive rates 
in the original data, and discrimination is lower, as ex¬ 
pected. This suggests, that when discrimination is present 
in the training data, but usage of the protected character¬ 
istic is not allowed, classifiers tend to decrease the accep¬ 
tance rate, which may show better nominal discrimination 
figures, but the real underlying discrimination (measured 
by normalized S) remains. 

Finally, Figure 4 presents normalized accuracies and dis¬ 
criminations at different acceptance rates. Overall we can 
see that massaging does remove some of discrimination, 
but at many acceptance rates the removal is not very pre¬ 
cise, and sometimes even overshoots introducing a reverse 
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Figure 4. Performance of baseline classifiers. 


discrimination. This calls for a revision of the massaging, 
and possibly other discrimination removal techniques, tak¬ 
ing into consideration possibility of different acceptance 
rates and normalized measures of discrimination. 


6. Conclusion 

Evaluation of non-discriminatory classifiers needs to take 
into account positive output rates, otherwise the compari¬ 
son may be misleading and conclusions about comparative 
performance may be invalid. 

We have introduced a normalization factor for discrimina¬ 
tion measure, considering the maximum possible discrimi¬ 
nation at a given acceptance rate. The maximum discrimi¬ 
nation is present when the protected individuals start to be 
accepted only after everybody from the favored community 
is accepted. 

Acceptance rates may be constrained by resources, and not 
freely available to choose for decision makes. If the accep¬ 
tance rate in the data and in the classifier outputs is fixed, 
then classifiers are comparable in terms of A and d, other¬ 
wise they need to be compared in terms of k and 6. 
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